INN Hotels Group has a chain of hotels in Portugal, they are facing
problems with the high number of booking cancellations. In the hotel
industry, effective management of bookings is crucial for optimizing
revenue and ensuring operational efficiency.
Our main objective in this project is to analyze the dataset to find
which factors have a high influence on booking cancellations while
answering some SMART questions. By examining customer attributes such as
room preferences, lead time, meal plan choices, and past booking
behavior, we will identify key patterns and trends. Additionally, we
will also analyze other variables such as special requests, arrival
month and market segments.
The dataset used in this project, sourced from Kaggle, contains detailed information on hotel bookings. With around 36,000 records from city and resort hotels, it provides insight into the factors contributing to hotel booking cancellations. The dataset includes variables such as lead time (the number of days between booking and arrival), room type, market segment, average daily rate (ADR), and special requests. These variables offer a robust foundation for analyzing cancellation trends and customer behavior.
A key strength of this dataset is its comprehensive scope. It includes not only basic booking details but also behavioral indicators like special requests, allowing for a deeper exploration of how different types of customers behave. This enables us to evaluate factors beyond the usual booking characteristics, such as how the length of stay or lead time influences cancellation likelihood.
It’s worth keeping in mind several limitations that could influence the results of the analysis. First, this dataset lacks demographic data like specific age groups, gender, or income, which are often critical in understanding customer behavior patterns. Without this information, it becomes more difficult to explore how demographic factors could correlate with cancellations. The dataset also only contains bookings from 2017 to 2018, which limits the ability to understand longer-term trends or assess the impact of other external factors that might have fluctuated across different years, such as economic upturns/downturns, the Covid-19 pandemic, or general industry shifts.
The dataset source does not explicitly state how the data was gathered, but it was likely obtained from a “hotel reservation system” or HRS. This system is designed to automate the booking process, manage room availability, and handle customer data, including booking status, lead time, room type, special requests, and much more. The data is typically stored in the hotel’s “Property Management System” or PMS, which centralizes information from various booking channels such as the hotel’s websitem third party platforms (like Booking.com), and over-the-phone bookings. The PMS ensures that reservation data is synchronized across platforms and stored securely for retrieval during check-in and check-out.(1) Through these systems, hotels reduce human error and increase efficiency, although missing data or occasional discrepancies may still occur due to system errors/limitations, or other external factors.
Prior research into hotel booking cancellations has consistently highlighted the importance of lead time—the period between when a booking is made and the actual check-in date—as a key predictor of cancellations. Studies show that customers who book far in advance tend to cancel more often, likely because they have more time to change their plans. This trend has been confirmed in other hospitality datasets, where longer lead times are correlated with a higher likelihood of cancellations.(2) Additionally, market segmentation plays a role in understanding booking behaviors. Corporate clients and group bookings generally exhibited lower cancellation rates compared to individual leisure travelers. This is likely due to the stricter nature of corporate travel policies and pre-arranged group contracts, which offer less flexibility for last-minute cancellations. On the other hand, bookings made through online travel agents (OTAs) tend to have higher cancellation rates compared to direct bookings.(2)
Research into hotel booking patterns significantly influenced the development of the research questions for this project. Based on our prior findings, the hypothesis was that lead time would be a strong predictor of cancellations, as customers who book earlier have more opportunities to cancel. The decision to focus on variables such as special requests and ADR was similarly guided by studies that suggested these variables would also influence cancellation behavior. Additionally, our assumptions concerning seasonality informed our decision to analyze cancellation patterns over different times of the year.
To improve the current analysis, the inclusion of demographic data such as customer age, gender, and income would offer a more thorough understanding of customer behavior, allowing for better segmentation and tailored strategies to prevent cancellations. Additionally, having detailed information about the cancellation policies—specifically (whether bookings were refundable or non-refundable) would provide crucial insights into how financial incentives influence cancellation decisions. This data could help differentiate between voluntary cancellations and those prompted by policy restrictions. Lastly, data on external factors, such as travel restrictions, promotions, or special events that occurred during the booking period, would also provide importent context, helping to explain possible spikes in cancellations or booking behaviors that may otherwise seem anomalous. This broader dataset would allow for a more comprehensive analysis.
1. Do customers who make special requests cancel
their bookings less frequently than those who don’t?
2. Are guests with no previous cancellations more
likely to avoid canceling their current booking?
3. How does the booking status varies upon meal plan,
room type, lead time and average room price ?
4. What are the key factors that show the strongest
correlation with booking cancellations?
5. How do hotel cancellation rates change across
different seasons (spring, fall, winter, summer, holidays, low season),
and which factors (e.g., lead time, room type/) correlate most strongly
with cancellations during each season?
For this project, we have selected a hotel booking dataset containing over 36,000 records of bookings. INNHotelsGroup dataset from Kaggle. This dataset offers a comprehensive overview of hotel booking patterns, making it ideal for our analysis on cancellations and no-shows.
hotel_data <- read.csv("../Dataset/INNHotelsGroup.csv")
This dataset contains 36275 observations of 19 variables. Out of these observations, 0 rows contain null values.
str(hotel_data)
## 'data.frame': 36275 obs. of 19 variables:
## $ Booking_ID : chr "INN00001" "INN00002" "INN00003" "INN00004" ...
## $ no_of_adults : int 2 2 1 2 2 2 2 2 3 2 ...
## $ no_of_children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_weekend_nights : int 1 2 2 0 1 0 1 1 0 0 ...
## $ no_of_week_nights : int 2 3 1 2 1 2 3 3 4 5 ...
## $ type_of_meal_plan : chr "Meal Plan 1" "Not Selected" "Meal Plan 1" "Meal Plan 1" ...
## $ required_car_parking_space : int 0 0 0 0 0 0 0 0 0 0 ...
## $ room_type_reserved : chr "Room_Type 1" "Room_Type 1" "Room_Type 1" "Room_Type 1" ...
## $ lead_time : int 224 5 1 211 48 346 34 83 121 44 ...
## $ arrival_year : int 2017 2018 2018 2018 2018 2018 2017 2018 2018 2018 ...
## $ arrival_month : int 10 11 2 5 4 9 10 12 7 10 ...
## $ arrival_date : int 2 6 28 20 11 13 15 26 6 18 ...
## $ market_segment_type : chr "Offline" "Online" "Online" "Online" ...
## $ repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ no_of_previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_price_per_room : num 65 106.7 60 100 94.5 ...
## $ no_of_special_requests : int 0 1 0 0 0 1 1 1 1 3 ...
## $ booking_status : chr "Not_Canceled" "Not_Canceled" "Canceled" "Canceled" ...
Data Dictionary
For our project, we are interested in majority of the variables but there are some irrelevant columns with respect to our objective and thus we will drop them.
hotel_data_clean <- hotel_data[, c("no_of_adults", "no_of_children", "no_of_weekend_nights", "no_of_week_nights", "type_of_meal_plan", "required_car_parking_space", "room_type_reserved", "lead_time", "arrival_year", "arrival_month", "market_segment_type", "repeated_guest", "no_of_previous_cancellations", "no_of_special_requests", "booking_status",
"avg_price_per_room")]
hotel_data_clean <- na.omit(hotel_data_clean)
After cleaning, we are left with 36275 observations of 16 variables.
write.csv(hotel_data_clean,"../Dataset/INNHotelsGroup_min.csv", row.names = F)
hotel_data_clean$type_of_meal_plan <- as.factor(hotel_data_clean$type_of_meal_plan)
hotel_data_clean$room_type_reserved <- as.factor(hotel_data_clean$room_type_reserved)
hotel_data_clean$booking_status <- as.factor(hotel_data_clean$booking_status)
hotel_data_clean$market_segment_type <- as.factor(hotel_data_clean$market_segment_type)
hotel_data_clean$arrival_year <- as.factor(hotel_data_clean$arrival_year)
hotel_data_clean$arrival_month <- as.factor(hotel_data_clean$arrival_month)
summary(hotel_data_clean)
## no_of_adults no_of_children no_of_weekend_nights no_of_week_nights
## Min. :0.00 Min. : 0.00 Min. :0.00 Min. : 0.0
## 1st Qu.:2.00 1st Qu.: 0.00 1st Qu.:0.00 1st Qu.: 1.0
## Median :2.00 Median : 0.00 Median :1.00 Median : 2.0
## Mean :1.84 Mean : 0.11 Mean :0.81 Mean : 2.2
## 3rd Qu.:2.00 3rd Qu.: 0.00 3rd Qu.:2.00 3rd Qu.: 3.0
## Max. :4.00 Max. :10.00 Max. :7.00 Max. :17.0
##
## type_of_meal_plan required_car_parking_space room_type_reserved
## Meal Plan 1 :27835 Min. :0.000 Room_Type 1:28130
## Meal Plan 2 : 3305 1st Qu.:0.000 Room_Type 2: 692
## Meal Plan 3 : 5 Median :0.000 Room_Type 3: 7
## Not Selected: 5130 Mean :0.031 Room_Type 4: 6057
## 3rd Qu.:0.000 Room_Type 5: 265
## Max. :1.000 Room_Type 6: 966
## Room_Type 7: 158
## lead_time arrival_year arrival_month market_segment_type
## Min. : 0 2017: 6514 10 : 5317 Aviation : 125
## 1st Qu.: 17 2018:29761 9 : 4611 Complementary: 391
## Median : 57 8 : 3813 Corporate : 2017
## Mean : 85 6 : 3203 Offline :10528
## 3rd Qu.:126 12 : 3021 Online :23214
## Max. :443 11 : 2980
## (Other):13330
## repeated_guest no_of_previous_cancellations no_of_special_requests
## Min. :0.000 Min. : 0.00 Min. :0.00
## 1st Qu.:0.000 1st Qu.: 0.00 1st Qu.:0.00
## Median :0.000 Median : 0.00 Median :0.00
## Mean :0.026 Mean : 0.02 Mean :0.62
## 3rd Qu.:0.000 3rd Qu.: 0.00 3rd Qu.:1.00
## Max. :1.000 Max. :13.00 Max. :5.00
##
## booking_status avg_price_per_room
## Canceled :11885 Min. : 0
## Not_Canceled:24390 1st Qu.: 80
## Median : 99
## Mean :103
## 3rd Qu.:120
## Max. :540
##
Summary:
The average price per room in the dataset is 103 euros, with a median of
99 euros, but prices can reach up to 540 euros, indicating high-priced
outliers. Some entries even show an average price of zero, possibly
reflecting promotional deals. Guests typically stay for two weekday
nights and one weekend night, with the average number of weekday nights
being 2.2 and weekend nights 0.81. Stays can extend to as many as 17
weekday nights. Most bookings involve two adults, and many guests do not
bring children. Lead times vary significantly, with an average of 85
days, a median of 57, and some bookings made up to 443 days in advance,
suggesting a right-skewed distribution. The data also shows sparse
previous cancellations, with an average of just 0.02, and a maximum of
58. Bookings are spread over 2017 and 2018, peaking in August.
Additionally, most guests do not make special requests, as the median is
zero, although some make up to five requests per booking.
print(ggplot(hotel_data_clean, aes(x = booking_status)) +
geom_bar(fill = "lightblue") +
labs(title = "Booking Status Distribution", x = "Booking Status", y = "Count"))
Summary
The dataset contains 36,275 bookings, with 24,390 classified as “Not
Canceled” and 11,885 as “Canceled”. Approximately two-thirds of the
bookings were completed, while the remaining one-third were
canceled.
print(ggplot(hotel_data_clean, aes(x = type_of_meal_plan, fill = booking_status)) +
geom_bar(position = "fill") +
labs(title = "Cancellation Rate by Meal Plan", x = "Meal Plan", y = "Proportion", fill = "Booking Status"))
Summary Meal Plan 1 has the highest cancellation rate, where more than half of the bookings are canceled. Meal Plan 2 has a lower cancellation rate, with cancellations and non-cancellations being nearly equal. Meal Plan 3 shows the highest proportion of non-cancelled bookings compared to canceled ones, suggesting customers choosing this meal plan tend to cancel less frequently. For the “Not Selected” category, there is a relatively high cancellation rate, similar to Meal Plan 1.
print(ggplot(hotel_data_clean, aes(x = room_type_reserved, fill = booking_status)) +
geom_bar(position = "fill") +
labs(title = "Cancellation Rate by Room Type", x = "Room Type", y = "Proportion", fill = "Booking Status"))
Summary
Room Type 6 have the highest cancellation rates, with a large proportion
of canceled bookings. Room Type 7 stands out as having the highest
proportion of non-cancelled bookings, suggesting it may be a preferred
or better-secured type of room. For other room types, the rates are
fairly similar, with around 60-70% of bookings not canceled, except for
Room Type 1 which has more cancellations.
print(ggplot(hotel_data_clean, aes(x = factor(no_of_special_requests))) +
geom_bar(fill = "lightblue") +
labs(title = "Number of Special Requests", x = "Number of Special Requests", y = "Count") +
theme_minimal())
Summary
Most guests make no special requests, with a sharp decline as the number
increases. A significant portion makes one request, while two or more
requests are increasingly rare.
contingency_table <- table(hotel_data_clean$market_segment_type, hotel_data_clean$booking_status)
plot_data <- as.data.frame(contingency_table)
colnames(plot_data) <- c("Market_Segment", "Booking_Status", "Count")
print(ggplot(plot_data, aes(x = Market_Segment, y = Count, fill = Booking_Status)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Booking Status by Market Segment",
x = "Market Segment",
y = "Count",
fill = "Booking Status") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)))
Summary
Online market segment have the highest cancellation count, with a large
proportion of canceled bookings of 8475. Offline segment has a
cancellation count of 3253 For other market segment, the cancellation
rates are too less.
ggplot(hotel_data_clean, aes(x = lead_time, fill = booking_status)) +
geom_histogram(binwidth = 10, position = "dodge") +
labs(
title = "Lead Time vs Booking Status",
x = "Lead Time (Days)",
y = "Number of Bookings",
fill = "Booking Status"
) +
theme_minimal()
Summary
Here we can see that the booking with short lead times are less
cancelled and as the lead time increases there are more booking
cancellations. Also, we can see more cancellation happening between lead
time form 100-200.
ggplot(hotel_data_clean, aes(x = booking_status, y = avg_price_per_room, fill = booking_status)) +
geom_boxplot() +
labs(
title = "Average Room Price vs Booking Status",
x = "Booking Status",
y = "Average Room Price"
) +
theme_minimal() +
theme(legend.position = "none")
Summary
The booking with cancelled status has higher median average room price
value compared to non-cancelled booking status.This might mean that the
booking with her average room price have more chances to get
cancelled.
Null Hypothesis (H₀): There is no significant
association between special requests and booking status.
Alternative Hypothesis (H₁:)There is significant
association between special requests and booking status.
hotel_data_clean <- hotel_data_clean %>%
mutate(has_special_requests = ifelse(no_of_special_requests > 0, "Yes", "No"),
is_cancelled = ifelse(booking_status == "Canceled", "Canceled", "Not_Canceled"))
special_requests_table <- table(hotel_data_clean$has_special_requests, hotel_data_clean$is_cancelled)
sum(special_requests_table) > 0
## [1] TRUE
special_requests_test <- chisq.test(special_requests_table)
special_requests_props <- prop.table(special_requests_table, margin = 1)
Contingency Table:
print(special_requests_table)
##
## Canceled Not_Canceled
## No 8545 11232
## Yes 3340 13158
Chi-Square Test Results:
print(special_requests_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: special_requests_table
## X-squared = 2152, df = 1, p-value <2e-16
print(ggplot(hotel_data_clean, aes(x = has_special_requests, fill = is_cancelled)) +
geom_bar(position = "fill") +
labs(title = "Cancellation Rates by Special Requests",
x = "Has Special Requests", y = "Proportion") +
theme_minimal())
Summary:
The chi-square test result shows a very small p-value which is much less
than the typical significance level of 0.05. Due to this we reject the
null hypothesis (H₀) and accept the alternative hypothesis (H₁).
Bookings with special requests have a much lower cancellation rate (20.2%) compared to those without (43.2%). The data strongly suggests that customers who make special requests are more likely to follow through with their bookings and significantly less likely to cancel their bookings.
Null Hypothesis (H₀): There is no association
between having previous cancellations and the likelihood of canceling
the current booking.
Alternative Hypothesis (H₁): Guests with no previous
cancellations are less likely to cancel their current booking.
hotel_data_clean <- hotel_data_clean %>%
mutate(has_previous_cancellations = ifelse(no_of_previous_cancellations > 0, "Yes", "No"))
previous_cancellations_table <- table(hotel_data_clean$has_previous_cancellations, hotel_data_clean$is_cancelled)
previous_cancellations_test <- chisq.test(previous_cancellations_table)
previous_cancellations_props <- prop.table(previous_cancellations_table, margin = 1)
Contingency Table:
print(previous_cancellations_table)
##
## Canceled Not_Canceled
## No 11869 24068
## Yes 16 322
Chi-Square Test Results:
print(previous_cancellations_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: previous_cancellations_table
## X-squared = 120, df = 1, p-value <2e-16
print(ggplot(hotel_data_clean, aes(x = has_previous_cancellations, fill = is_cancelled)) +
geom_bar(position = "fill") +
labs(title = "Cancellation Rates by Previous Cancellations",
x = "Has Previous Cancellations", y = "Proportion") +
theme_minimal())
Summary:
Due to the p-value being below the significance threshold of 0.05, we
reject the null hypothesis. This provides strong statistical evidence to
support the alternative hypothesis that guests with previous
cancellations are less likely to cancel their current booking.
Surprisingly, bookings from guests with previous cancellations have a much lower cancellation rate (4.73%) compared to those without previous cancellations (33.03%). This represents a substantial difference of about 28.3 percentage points.
Null Hypothesis (H₀): There is no association between booking
status and type of meal plan.
Alternative Hypothesis (H₁): There is an association between
booking status and type of meal plan.
hotel_data_clean$canceled <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)
# Subsetting the data for customers who canceled
df_canceled <- subset(hotel_data_clean, canceled == 1)
# Subsetting the data for customers who did not cancel
df_not_canceled <- subset(hotel_data_clean, canceled == 0)
ttestlead_time <- t.test(df_canceled$lead_time, df_not_canceled$lead_time,
alternative = "two.sided", conf.level = 0.95)
ttestlead_time
##
## Welch Two Sample t-test
##
## data: df_canceled$lead_time and df_not_canceled$lead_time
## t = 81, df = 16886, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 78.3 82.2
## sample estimates:
## mean of x mean of y
## 139.2 58.9
Summary:
The t-test analysis reveals a significant difference in lead times
between canceled and non-canceled bookings. On average, bookings that
were canceled had a lead time of 139.2 days, while non-canceled bookings
had a much shorter average lead time of 58.9 days. This difference is
highly statistically significant (p-value < 0.05), suggesting that
guests with longer lead times are more prone to cancel their
reservations.
Null Hypothesis (H₀): There is no significant difference in
cancellation rates based on the Average Room Price.
Alternative Hypothesis (H₁): There is a significant difference
in cancellation rates based on the Average Room Price.
# Creating a binary variable for cancellation
hotel_data_clean$canceled <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)
# Subsetting the data for customers who canceled
df_canceled <- subset(hotel_data_clean, canceled == 1)
# Subsetting the data for customers who did not cancel
df_not_canceled <- subset(hotel_data_clean, canceled == 0)
# Performing a t-test to compare average price per room between those who canceled and those who didn't
ttest_avg_price_per_room <- t.test(
df_canceled$avg_price_per_room,
df_not_canceled$avg_price_per_room,
alternative = "two.sided",
conf.level = 0.95
)
print(ttest_avg_price_per_room)
##
## Welch Two Sample t-test
##
## data: df_canceled$avg_price_per_room and df_not_canceled$avg_price_per_room
## t = 28, df = 25929, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 9.92 11.39
## sample estimates:
## mean of x mean of y
## 110.6 99.9
Summary:
With extremely small p-value indicates a statistically significant
difference between the average room prices for canceled and non-canceled
bookings.The bookings with higher average room prices are more likely to
be canceled compared to those with lower prices. Specifically, the
average room price for canceled bookings is approximately 10.7 units
higher than for non-canceled bookings.
Null Hypothesis (H₀): There is no significant difference in
cancellation rates based on the type of meal plan.
Alternative Hypothesis (H₁): There is a significant difference
in cancellation rates based on the type of meal plan.
hotel_data_clean$type_of_meal_plan <- as.factor(hotel_data_clean$type_of_meal_plan)
hotel_data_clean$booking_status <- as.factor(hotel_data_clean$booking_status)
contingency_table <- table(hotel_data_clean$type_of_meal_plan, hotel_data_clean$booking_status)
chi_test_result <- chisq.test(contingency_table)
print(chi_test_result)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 278, df = 3, p-value <2e-16
Summary:
The Chi-squared test suggests that meal plan type significantly affects
whether a booking is canceled or not. The p-value is far below the
standard threshold (0.05), meaning the differences observed between meal
plans in terms of cancellation rates are highly unlikely to be due to
chance. And as seen in EDA, the meal plan 1 has the highest cancellation
rate.
Null Hypothesis (H₀): There is no significant difference in
cancellation rates based on the room type.
Alternative Hypothesis (H₁): There is a significant difference
in cancellation rates based on the room type.
hotel_data_clean$room_type_reserved <- as.factor(hotel_data_clean$room_type_reserved)
#df$booking_status <- as.factor(df$booking_status)
# Create a contingency table
contingency_table <- table(hotel_data_clean$room_type_reserved, hotel_data_clean$booking_status)
# Perform the Chi-squared test
chi_test_result <- chisq.test(contingency_table)
# Display the result of the test
print(chi_test_result)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 57, df = 6, p-value = 2e-10
Summary:
The statistical analysis shows a significant relationship between the
room type and the likelihood of booking cancellation. This insight
suggests that certain room types are more prone to cancellations than
others. And as we have seen in EDA that Room type 6 has the highest
cancellation rates, so they is more prone to cancellation.
# Convert booking_status to a binary variable
hotel_data_clean$booking_status_binary <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)
# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}
# Remove columns with zero variance
data_filtered <- remove_zero_variance(select_if(hotel_data_clean, is.numeric))
# Calculate the correlation matrix
cor_data <- cor(data_filtered, use = "complete.obs")
# Visualize the correlation matrix
corrplot(cor_data, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Entire Dataset", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
Summary:
The correlation matrix for the entire dataset reveals several key
insights regarding factors associated with booking cancellations. Lead
time shows the strongest positive correlation with cancellations (0.44),
indicating that bookings made further in advance are more likely to be
canceled. This makes sense, since customers have more time to change
their plans or reconsider their bookings when the time between
reservation and stay is longer. The number of special requests has a
negative correlation (-0.25) with cancellations, implying that customers
who make more personalized arrangements, such as requesting specific
room types or amenities, are generally more committed to their bookings
and less likely to cancel. The average price per room shows a weak
positive correlation (0.14), suggesting that higher-priced bookings are
slightly more prone to cancellation, although this relationship is not
very strong. It’s possible that the higher financial commitment
associated with more expensive rooms leads some customers to reconsider
their bookings. Meanwhile, being a repeated guest has a small negative
correlation (-0.11) with cancellations, indicating that loyal customers
are slightly less likely to cancel their bookings. Interestingly,
factors such as the number of adults, children, weekend nights, and week
nights show little to no correlation with cancellations, suggesting that
the composition of the travel party and the length of stay do not
significantly impact the likelihood of a booking being canceled. In
summary, lead time stands out as the most significant predictor of
cancellations, while customer loyalty factors such as number of special
requests, and being a repeated guest reduce the likelihood of
cancellations. Although price plays a role, it is not a major
determinant of cancellation behavior.
In addition to analyzing the overall correlations in the dataset, we decided to perform a seasonality correlation analysis to explore how booking behavior might vary across different times of the year. Since factors like lead time, pricing, and cancellation rates can fluctuate with seasonal trends, understanding how these relationships change across spring, summer, fall, and winter could provide deeper insights into customer behavior and help tailor strategies to minimize cancellations based on seasonal patterns.
# Define seasons based on arrival month
hotel_data_clean$season <- case_when(
hotel_data_clean$arrival_month %in% c(3, 4, 5) ~ "Spring",
hotel_data_clean$arrival_month %in% c(6, 7, 8) ~ "Summer",
hotel_data_clean$arrival_month %in% c(9, 10, 11) ~ "Fall",
hotel_data_clean$arrival_month %in% c(12, 1, 2) ~ "Winter",
TRUE ~ "Unknown"
)
# Convert booking_status to a binary variable
hotel_data_clean$booking_status_binary <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)
# Create subsets for each season
spring_data <- subset(hotel_data_clean, season == "Spring")
summer_data <- subset(hotel_data_clean, season == "Summer")
fall_data <- subset(hotel_data_clean, season == "Fall")
winter_data <- subset(hotel_data_clean, season == "Winter")
# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}
# Remove zero variance columns for each season
spring_data_filtered <- remove_zero_variance(select_if(spring_data, is.numeric))
summer_data_filtered <- remove_zero_variance(select_if(summer_data, is.numeric))
fall_data_filtered <- remove_zero_variance(select_if(fall_data, is.numeric))
winter_data_filtered <- remove_zero_variance(select_if(winter_data, is.numeric))
# Calculate and visualize correlation for Spring
cor_spring <- cor(spring_data_filtered, use = "complete.obs")
corrplot(cor_spring, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Spring", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Summer
cor_summer <- cor(summer_data_filtered, use = "complete.obs")
corrplot(cor_summer, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Summer", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Fall
cor_fall <- cor(fall_data_filtered, use = "complete.obs")
corrplot(cor_fall, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Fall", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
# Calculate and visualize correlation for Winter
cor_winter <- cor(winter_data_filtered, use = "complete.obs")
corrplot(cor_winter, method = "color", addCoef.col = "black",
title = "Correlation Matrix for Winter", number.cex = 1,
tl.cex = 0.8, mar = c(1, 1, 2, 1))
Fall Correlation Matrix:
The strongest positive correlation with booking status (0.54) is lead
time, suggesting that as the lead time increases, there is a higher
chance of the booking being canceled. This makes sense, as bookings made
far in advance may have a higher likelihood of being reconsidered or
cancelled.
Number of special requests has a slight negative correlation with
booking status (-0.23), indicating that bookings with more special
requests tend to have a lower likelihood of cancellation.
Summer Correlation Matrix:
Lead time continues to show a strong positive correlation (0.43) with
booking status, compared to other variables, meaning that longer lead
times are associated with cancellations during the summer as well.
Number of special requests shows a moderate negative correlation
(-0.30), reinforcing the idea that bookings with special requests are
less likely to be cancelled.
Average price per room has a weak positive correlation (0.22),
suggesting that higher-priced rooms might be slightly more likely to be
cancelled in the summer.
Spring Correlation Matrix:
The positive correlation with lead time remains somewhat significant at
0.29, consistent with the previous seasons. Longer lead times are
associated with a higher likelihood of cancellation.
Special requests have the strongest negative correlation (-0.36) in the
spring, indicating that bookings with more special requests are much
less likely to be canceled during this season.
Average price per room shows a small positive correlation (0.13) with
booking status, indicating a slight tendency for higher-priced bookings
to be cancelled.
Winter Correlation Matrix:
Lead time remains a positive trending factor, with a correlation of
0.24. Longer lead times during the winter continue to be associated with
higher cancellation rates.
The number of special requests has a negative correlation (-0.14),
suggesting that special requests still reduce the likelihood of
cancellations in the winter, although the effect is less pronounced
compared to other seasons.
Average price per room has a slightly stronger positive correlation
(0.36), indicating that higher-priced rooms may have a higher chance of
being cancelled during the winter season.
Our initial research question aimed to identify the factors that most significantly influenced hotel booking cancellations. After conducting EDA, it became clear that some factors, such as lead time, special requests and meal plan type, played a far more significant role in predicting cancellations than others. The analysis also revealed that seasonality interacted with these variables, with certain seasons showing stronger correlations between cancellations and lead time. As a result, the question evolved to focus more specifically on the interaction between these high-impact factors. Based on our analysis, we can begin to outline an answer to the research question. Lead time emerged as the most critical factor, with longer lead times strongly correlated with cancellations, particularly during specific months of the year. Customers booking well in advance were more likely to cancel, especially in the months leading up to busy travel seasons. Special requests also appeared to reduce the likelihood of cancellations, suggesting that these customers are more committed to their bookings. These findings indicate that focusing on lead time management and offering incentives to secure early-booking customers could be key strategies to reduce cancellations.